Audio Similarity Search¶

This notebook seeks to leverage appropriate machine learning techniques to enable similarity search with audio files. The end goal is an effective audio similarity search algorithm that can return similar audio to a user selected file. After reviewing the research, the following approach is attempted:

  • Data Sourcing:
    • ~760 audio files (98MB on disk) were sourced from a collection of musical instrument samples collected over time, either from commercial libraries or Freesound.
    • The audio-data folder should be placed in the root directory alongside this .ipynb Jupyter notebook file, and can be downloaded here: https://drive.google.com/file/d/1RnQkkrvUNa-eWl3yDFaYKnV1lJU8fSjv/view?usp=sharing
    • This is a somewhat restricted use case, however enough to demonstrate the success/failure of the approach with respect to audio-based similarity search.
    • Audio has been standardized to 22,050Hz sample rate, 16-bit depth, .wav files in order to save disk space, however the ingestion process is capable of handling many audio file types and any sample rate and bit-depth, and itself loads audio at a fixed sample rate.
  • Data Ingestion:
    • Locating and loading Audio files: Up to the first 2 seconds of each audio file is loaded
      • This is a simplifying step that would be trivial to change or add sophisitcation around which 2 section of audio should be loaded (eg; the middle 2 seconds, etc.)
    • Mel Spectrogram image generation for each loaded audio file via librosa, with algorithm hop-size set to spread the spectrogram for the given audio length across the entire image size.
    • Image Normalization (value normalization)
  • Images are fed into a novel Convolutional Autoencoder model in order to train an Encoder to encode the mel spectrogram features into compressed yet meaningful feature vectors.
  • The trained Encoder from the Convolutional Autoencoder can then be employed to enable audio to be searched, wherein the encoded embeddings (feature vectors) are used directly in a simple Cosine Similarity search.
  • It may be possible to achieve even better image reconstruction with transfer learning, wherein something like a pre-trained network (eg; VGG-19) could be leveraged and fine-tuned for this purpose.
  • More sophisticated search algorithms, possibly coupled with further dimensionality reduction, may lead to further useful results.

Setup¶

In [2]:
# Python built-ins
import time
import os
import random

# math/vector maniuplation and plots
import numpy as np
import matplotlib.pyplot as plt

# image saving
from skimage import io

# audio processing library
import librosa
import librosa.display

# audio playback in notebook
import IPython.display as ipd

# Scikit Learn
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
from sklearn.metrics.pairwise import cosine_similarity

# TensorFlow and Keras
import tensorflow as tf
from tensorflow import keras
#from tensorflow.keras import layers
#from tensorflow.keras.models import Model
#from tensorflow.keras.callbacks import TensorBoard

Global Variables¶

In [3]:
max_audio_duration = 2 # seconds
audio_samplerate = 22050 # Hz
n_fft = 4096 # number of ffts per window
fmin = 20 # min frequency to consider in mel cepstral coeffs
fmax = audio_samplerate // 2 # max frequency to consider in mel cepstral coeffs
input_image_size = (256, 256) # (width, height) pixels
image_value_type = np.uint16 # bit-depth of image, 16-bit seems to provide the best results
image_max_value = np.iinfo(image_value_type).max # used in scaling values 0-1
test_set_size = 0.2 # % of dataset to reserve for test set
trained_encoder_location = 'trained_encoders/ConvEncoder'
training_clip_paths = ['Drums', 'Bass', 'Keys']
inference_clip_paths = ['Drums', 'Bass', 'Keys']

Helper Methods¶

In [4]:
def get_clip_paths(rel_paths):
    """
    Returns a list of full paths for all audio supported files under each path in rel_paths.
    """
    clip_paths = []
    for rel_path in rel_paths:
        audio_wav_dir = f'audio-data/{rel_path}'
        for root, dirs, files in os.walk(audio_wav_dir):
            for file in files:
                path = os.path.join(root, file)
                if path.endswith('.wav'):
                    clip_paths.append(path)
    return clip_paths
In [5]:
def get_audio_samples(path, min_samples):
    """
    Returns list of samples for an audio file at path,
    """
    y, _ = librosa.load(path=path, mono=True, sr=audio_samplerate, offset = 0, duration = max_audio_duration)
    # check if needs padding, up to n_fft used in mel calcs
    if len(y) < min_samples:
        padding = min_samples - len(y)
        y = np.pad(y, (0, min_samples - len(y)), 'constant')
    return y
In [6]:
def scale_minmax(X, max=image_max_value):
    """
    Normalizes values in range 0 - max.
    """
    X_min = X.min()
    X_max = X.max()
    X_diff = X_max - X_min
    return ((X - X_min)/X_diff) * max

def get_mel_spectrogram(y, sr, filename, save_file=False):
    """
    Returns mel spectrogram image data given audio data.
    """
    S = librosa.feature.melspectrogram(y=y,
                                       sr=sr,
                                       n_mels=input_image_size[1],
                                       n_fft=n_fft, 
                                       hop_length=max(int(len(y)/input_image_size[0]), 1),
                                       fmin = fmin,
                                       fmax = fmax)

    # convert to power-scale (decibels)
    img = librosa.power_to_db(S, ref=np.max).astype(np.float32)
    
    # discard extra mfcc columns, pad missing
    if img.shape[1] > input_image_size[0]:
        img = img[:, :input_image_size[0]]
    elif img.shape[1] < input_image_size[0]:
        img = np.pad(img, [(0,0), (0,input_image_size[0] - img.shape[1])], 'constant')
    
    # scale mel values to image values
    img = scale_minmax(img)
    
    # put low frequencies at the bottom in image (typical human readable format)
    img = np.flip(img, axis=0)
    
    img = img.astype(image_value_type)
    
    # save as PNG
    if save_file:
        print(f'Saving PNG: {filename}')
        io.imsave(filename.replace(".wav", ".png"), img)

    return img

Data Ingestion and Preprocessing¶

In [7]:
def get_data(paths, save_images=False):
    """
    Get data from paths, prepared for training from. Optionally save Mel Spectrogram image files.
    """
    audio_paths = get_clip_paths(paths)
    mels = []
    paths = []
    for path in audio_paths:
        paths.append(path)
        samples = get_audio_samples(path, n_fft)
        mel = get_mel_spectrogram(samples, audio_samplerate, path, save_file=save_images)
        # we add a color channel, as the model will expect this in the input shape, and append to dataset
        mels.append(mel.reshape(input_image_size[1], input_image_size[0], 1))

    # Scale values to 0-1
    X = np.array(mels)
    X = X / float(image_max_value)
    return paths, X

Mel Spectrogram Examples¶

Each image below is a Mel Spectrogram depicting an audio file. In these images, frequencies are represented on a log scale on the Y axis (lower frequencies on the bottom, higher frequencies on the top), across time on the X axis (scaled to the duration of audio imported, up to 2 seconds).

Train Test Split¶

In [8]:
Y, X = get_data(training_clip_paths, save_images=False)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=test_set_size, random_state=42)
X_train = np.reshape(X_train, (len(X_train), input_image_size[0], input_image_size[1], 1))
X_test = np.reshape(X_test, (len(X_test), input_image_size[0], input_image_size[1], 1))

Build Convolutional Autoencoder¶

The convolutional autoencoder below is built using the Keras functional API. It consists of a relatively simple Encoder and Decoder made of convolution and deconvolution (upsampling) layers. As configured, the Encoder creates an embedding with shape (16,16,16) which represents 75% memory compression from the input space. The Decoder simply reconstructs the input from the encoded embedding. The model uses Adam optimization and MSE loss.

Many architectures were tested, adjusting the number of filters, kernel size, and activation type, as well as number of layers (how far the compression could go and still be useful). The addition of a Dense (fully connected) layer was also tested at the 'bottom' of the Encoder, as seen in a number of examples in the research. However, this seemed to make training loss far higher, so the inclusion of this layer was quickly abandoned.

In [9]:
input_img = tf.keras.layers.Input(shape=(input_image_size[0],input_image_size[1], 1))

# Encoder
x = tf.keras.layers.Conv2D(128, (3, 3), activation="relu", padding="same")(input_img)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(64, (3, 3), activation="relu", padding="same")(x)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(32, (3, 3), activation="relu", padding="same")(x)
x = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)
x = tf.keras.layers.Conv2D(16, (3, 3), activation="relu", padding="same")(x)
encoded = tf.keras.layers.MaxPooling2D((2, 2), padding="same")(x)

# encoded.shape = (16,16,16)
# ~94% value compression (16*16*16)/(256*256)
# ~75% memory compression (16*16*16*4 bytes)/(256*256*1 bytes)

# Decoder
x = tf.keras.layers.Conv2D(16, (3, 3), activation='relu', padding='same')(encoded)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(32, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(64, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
x = tf.keras.layers.Conv2D(128, (3, 3), activation='relu', padding='same')(x)
x = tf.keras.layers.UpSampling2D((2, 2))(x)
decoded = tf.keras.layers.Conv2D(1, (3, 3), padding='same')(x) # no activation

# Autoencoder
autoencoder = tf.keras.models.Model(input_img, decoded)
autoencoder.compile(optimizer="adam", loss="mean_squared_error")
autoencoder.summary()
2022-10-20 14:04:50.648237: E tensorflow/stream_executor/cuda/cuda_driver.cc:265] failed call to cuInit: CUDA_ERROR_COMPAT_NOT_SUPPORTED_ON_DEVICE: forward compatibility was attempted on non supported HW
2022-10-20 14:04:50.648361: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: pop-os
2022-10-20 14:04:50.648391: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: pop-os
2022-10-20 14:04:50.648690: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 515.65.1
2022-10-20 14:04:50.648757: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 510.85.2
2022-10-20 14:04:50.648775: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 510.85.2 does not match DSO version 515.65.1 -- cannot find working devices in this configuration
2022-10-20 14:04:50.649404: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_1 (InputLayer)        [(None, 256, 256, 1)]     0         
                                                                 
 conv2d (Conv2D)             (None, 256, 256, 128)     1280      
                                                                 
 max_pooling2d (MaxPooling2D  (None, 128, 128, 128)    0         
 )                                                               
                                                                 
 conv2d_1 (Conv2D)           (None, 128, 128, 64)      73792     
                                                                 
 max_pooling2d_1 (MaxPooling  (None, 64, 64, 64)       0         
 2D)                                                             
                                                                 
 conv2d_2 (Conv2D)           (None, 64, 64, 32)        18464     
                                                                 
 max_pooling2d_2 (MaxPooling  (None, 32, 32, 32)       0         
 2D)                                                             
                                                                 
 conv2d_3 (Conv2D)           (None, 32, 32, 16)        4624      
                                                                 
 max_pooling2d_3 (MaxPooling  (None, 16, 16, 16)       0         
 2D)                                                             
                                                                 
 conv2d_4 (Conv2D)           (None, 16, 16, 16)        2320      
                                                                 
 up_sampling2d (UpSampling2D  (None, 32, 32, 16)       0         
 )                                                               
                                                                 
 conv2d_5 (Conv2D)           (None, 32, 32, 32)        4640      
                                                                 
 up_sampling2d_1 (UpSampling  (None, 64, 64, 32)       0         
 2D)                                                             
                                                                 
 conv2d_6 (Conv2D)           (None, 64, 64, 64)        18496     
                                                                 
 up_sampling2d_2 (UpSampling  (None, 128, 128, 64)     0         
 2D)                                                             
                                                                 
 conv2d_7 (Conv2D)           (None, 128, 128, 128)     73856     
                                                                 
 up_sampling2d_3 (UpSampling  (None, 256, 256, 128)    0         
 2D)                                                             
                                                                 
 conv2d_8 (Conv2D)           (None, 256, 256, 1)       1153      
                                                                 
=================================================================
Total params: 198,625
Trainable params: 198,625
Non-trainable params: 0
_________________________________________________________________

Train Autoencoder¶

This model trains relatively quickly, even on CPU. The model is set to train for only 10 epochs, however, it seems to converge rapidly and it is not clear that training further would yield signifcantly better results. Shuffle is set to True, which will shuffle the input dataset at the beginning of each epoch (this is a type of regularization).

Note that X_train is passed as x and y, given this is an autoencoder working with unlabeled data. The inputs are fed to the autoencoder, and the outputs of the autoencoder are compared to the inputs. Keras models also support passing in validation data (X_test in this case), which allows loss evaluation at the end of each epoch – the model is not trained on this data.

In [10]:
autoencoder.fit(X_train, X_train,
                epochs=10,
                batch_size=32,
                shuffle=True,
                validation_data=(X_test, X_test))
2022-10-20 14:04:54.848960: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 160432128 exceeds 10% of free system memory.
Epoch 1/10
2022-10-20 14:04:55.221853: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 160432128 exceeds 10% of free system memory.
2022-10-20 14:04:56.212995: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 1073741824 exceeds 10% of free system memory.
2022-10-20 14:04:56.598244: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 268435456 exceeds 10% of free system memory.
2022-10-20 14:04:56.689577: W tensorflow/core/framework/cpu_allocator_impl.cc:82] Allocation of 134217728 exceeds 10% of free system memory.
20/20 [==============================] - 246s 12s/step - loss: 0.0336 - val_loss: 0.0089
Epoch 2/10
20/20 [==============================] - 180s 9s/step - loss: 0.0077 - val_loss: 0.0066
Epoch 3/10
20/20 [==============================] - 171s 9s/step - loss: 0.0062 - val_loss: 0.0060
Epoch 4/10
20/20 [==============================] - 173s 9s/step - loss: 0.0055 - val_loss: 0.0054
Epoch 5/10
20/20 [==============================] - 171s 9s/step - loss: 0.0051 - val_loss: 0.0051
Epoch 6/10
20/20 [==============================] - 171s 9s/step - loss: 0.0049 - val_loss: 0.0048
Epoch 7/10
20/20 [==============================] - 171s 9s/step - loss: 0.0046 - val_loss: 0.0046
Epoch 8/10
20/20 [==============================] - 171s 9s/step - loss: 0.0044 - val_loss: 0.0045
Epoch 9/10
20/20 [==============================] - 171s 9s/step - loss: 0.0043 - val_loss: 0.0044
Epoch 10/10
20/20 [==============================] - 171s 9s/step - loss: 0.0042 - val_loss: 0.0043
Out[10]:
<keras.callbacks.History at 0x7f328828b4f0>

Evaluate Autoencoder¶

Once trained, we can visually inspect the results of this autoencoder by comparing input images to reconstructed output images. We also visualize the input embeddings, and although the embedding itself is not 2-dimensional, some sense of what how the embedding represents the input can be ascertained. The reconstructed images reveal the lossy compression of the input embeddings, readily seen in the 'fuzziness' or 'blurriness' of the reconstruction with respect to the original input images. However, the features of the input images have been reconstructed in the output, meaning the input embeddings capture these features quite well.

In [11]:
n = 4
encoder = tf.keras.models.Model(input_img, encoded)
encoded_imgs = encoder.predict(X_test[:n+1])
decoded_imgs = autoencoder.predict(X_test[:n+1])
plt.figure(figsize=(16, 12))
for i in range(1, n + 1):
    # Display original
    ax = plt.subplot(3, n, i)
    plt.imshow(X_test[i].reshape(input_image_size[1], input_image_size[0]),  aspect='auto')
    plt.gray()
    plt.xticks([])  
    plt.yticks([])  
    ax.set_title(f'Filename: ...{Y_test[i][-20:-4]}')
    ax.set_ylabel('Input')

    # Display Embeddings
    ax = plt.subplot(3, n, i + n)
    plt.imshow(encoded_imgs[i].reshape(64,64), aspect='auto')
    plt.gray()
    plt.xticks([])  
    plt.yticks([]) 
    ax.set_ylabel('Emedding')
    
    # Display reconstruction
    ax = plt.subplot(3, n, i + n + n)
    plt.imshow(decoded_imgs[i].reshape(input_image_size[1], input_image_size[0]), aspect='auto')
    plt.gray()
    plt.xticks([])  
    plt.yticks([]) 
    ax.set_ylabel('Reconstruction')
    
plt.tight_layout()
plt.show()
1/1 [==============================] - 0s 206ms/step
1/1 [==============================] - 3s 3s/step

View PCA projection of Test Set Encoder Representations¶

PCA (Princpal Component Analysis) projection can give a sense of the embedding space and how the embeddings map with respect to each other. Below, this is visualized with 2 components (2 dimensions). The use of the PCA projection (using 3-9 components) of embeddings was tested in similarity search (taking cosine similarity of the PCA projection), but this seemed to perform more poorly than simply using the cosine similarity of the embeddings directly.

In [12]:
num_components = 2
encoded_imgs = encoder.predict(X_test)
embeddings = [np.ravel(e) for e in encoded_imgs]
pca = PCA(n_components=num_components)
pca.fit(embeddings)
pca_proj = pca.transform(embeddings)
print(pca.explained_variance_ratio_)
fig = plt.figure()
ax = fig.add_subplot()
for pca in pca_proj:
    ax.scatter(pca[0], pca[1])
plt.show
5/5 [==============================] - 3s 540ms/step
[0.51310923 0.16606748]
Out[12]:
<function matplotlib.pyplot.show(close=None, block=None)>

Save Trained Encoder Model¶

This step isolated the trained encoder and saves the model to be used separately.

In [13]:
model = tf.keras.models.Model(inputs=autoencoder.inputs, outputs=autoencoder.layers[8].output)
model.summary
model.save(trained_encoder_location)
WARNING:tensorflow:Compiled the loaded model, but the compiled metrics have yet to be built. `model.compile_metrics` will be empty until you train or evaluate the model.
WARNING:absl:Found untraced functions such as _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op, _jit_compiled_convolution_op while saving (showing 4 of 4). These functions will not be directly callable after loading.
INFO:tensorflow:Assets written to: trained_encoders/ConvEncoder/assets
INFO:tensorflow:Assets written to: trained_encoders/ConvEncoder/assets

Test Loading Saved Encoder and Create Embeddings¶

This simply tests loadig the Encoder model and creating and visualizing embeddings for test data.

In [14]:
# Load the encoder and time inference on the first n inputs from X_test 
n = 10
encoder = keras.models.load_model(trained_encoder_location)
%timeit encoded_imgs = encoder.predict(X_test[:n])

plt.figure(figsize=(20, 8))
for i in range(1, n + 1):
    ax = plt.subplot(1, n, i)
    plt.imshow(encoded_imgs[i].reshape((64, 64)))
    plt.gray()
    ax.set_title(Y_test[i][-16:-4])
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
plt.show()
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
1/1 [==============================] - 0s 236ms/step
1/1 [==============================] - 0s 194ms/step
1/1 [==============================] - 0s 186ms/step
1/1 [==============================] - 0s 201ms/step
1/1 [==============================] - 0s 190ms/step
1/1 [==============================] - 0s 196ms/step
1/1 [==============================] - 0s 196ms/step
1/1 [==============================] - 0s 213ms/step
226 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

K-Means Clustering¶

K-means clustering was intended to be used as a part of the similarity search, however, it appears that the varying cluster density and spread makes this challenging. Perhaps a much larger dataset would make this less of an issue, however this approach was abandoned in favor of using the embeddings directly with a Cosine Similarity matrix, as seen below.

In [15]:
clusters = [4, 8, 16, 32]
scores = []
embeddings = [np.ravel(e) for e in encoded_imgs]
for k in clusters:
    kmeans = KMeans(n_clusters=k, random_state=42).fit(embeddings)
    score = silhouette_score(embeddings, kmeans.labels_)
    print(f'k: {k}, silhouette score: {score}')
    scores.append(score)
k = clusters[np.argmax(scores)]
print(f'Optimal k: {k}')
kmeans = KMeans(n_clusters=k, random_state=42).fit(embeddings)
k: 4, silhouette score: 0.20878012478351593
k: 8, silhouette score: 0.196126326918602
k: 16, silhouette score: 0.17918486893177032
k: 32, silhouette score: 0.18431542813777924
Optimal k: 4

Inference Using Encoder, Similarity Database Searching¶

Below the Encoder is loaded and used to create embeddings on the entire audio dataset. A Cosine Similarity matrix is then created directly from the embeddings and cached in a 'similarity database' dictionary keyed by audio file path. The _get_similaraudio method can then directly retrieve the _nresults most similar audio files from the database based on the highest similarity values for a given audio file.

In [17]:
# load model and get audio data
start = time.time()
encoder = keras.models.load_model(trained_encoder_location)
all_paths, data = get_data(inference_clip_paths)
data = np.reshape(data, (len(data), input_image_size[0], input_image_size[1], 1))
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
WARNING:tensorflow:No training configuration found in save file, so the model was *not* compiled. Compile it manually.
In [18]:
# get encodings and and create similarity (cosine) database
start = time.time()

# create embeddings for all input data
encoded_data = encoder.predict(data) 

encoding_time = round(time.time()-start, 2)
print(f'encoding time: {encoding_time:.2f}s, {encoding_time/len(data):.3f}s per audio file.')

encoded_data = [np.ravel(e) for e in encoded_data]
similarity_matrix = cosine_similarity(encoded_data)
24/24 [==============================] - 15s 596ms/step
encoding time: 14.70s, 0.019s per audio file.
In [19]:
# create similarity database
similarity_database = {}
for P, S in zip(all_paths, similarity_matrix):
    similarity_database[P] = S
In [20]:
def get_similar_audio(path, similarity_database, n_results):
    """
    Retrieve n_results most similar audio files to the given audio file at path from the given similarity_database
    """
    n_results += 1
    all_paths = list(similarity_database.keys())
    ipd.display(ipd.Audio(path))
    # get similarity results for this audio
    similar = similarity_database[path]
    # get indexes of n_results highest values
    result_indexes = np.argpartition(similar, -n_results)[-n_results:]
    # build dictionary from results of paths and similarity scores
    results = {k:v for (k,v) in zip([all_paths[x] for x in result_indexes], [similar[x] for x in result_indexes])}
    # eliminate self from results
    results = {k:v for k, v in results.items() if k != path}
    # get keys for results by value in reverse sorted order
    sorted_keys = sorted(results, key=results.__getitem__, reverse=True)
    
    #show results
    count = 1
    for k in sorted_keys:
        print(f'Result {count}: {k}, Similarity: {results[k]:.3f}')
        ipd.display(ipd.Audio(k))
        count += 1

Testing Similarity Search¶

Below the results from similarity search are demonstrated. For _nexamples we select a random audio file from the database and return the _nresults most similar audio files. Using the inline audio players, one can audition the original audio file and the returned results. Running this cell multiple times will produce new examples each time.

In [21]:
# test returning results and check playback
n_examples = 5
n_results = 4
for i in range(n_examples):
    path = all_paths[random.randint(0, len(all_paths))] # select a random path from all_paths
    print(f'Finding similar sounds to {path}')
    get_similar_audio(path, similarity_database, n_results)
    print('\n\n-----------------------------------------------------------\n\n')
Finding similar sounds to audio-data/Keys/573628__acollier123__preset-jazz-organ-c.wav
Your browser does not support the audio element.
Result 1: audio-data/Keys/573639__acollier123__preset-pearl-drop-c.wav, Similarity: 0.950
Your browser does not support the audio element.
Result 2: audio-data/Keys/166009__acollier123__casio-hz600-01-piano-c.wav, Similarity: 0.929
Your browser does not support the audio element.
Result 3: audio-data/Bass/110518__nandoo1__nandoo-messany-horror-bell.wav, Similarity: 0.925
Your browser does not support the audio element.
Result 4: audio-data/Keys/573633__acollier123__preset-synphonic-ens-c.wav, Similarity: 0.925
Your browser does not support the audio element.

-----------------------------------------------------------


Finding similar sounds to audio-data/Bass/320046__staticpony1__analog-bass-vel-2.wav
Your browser does not support the audio element.
Result 1: audio-data/Bass/320047__staticpony1__analog-bass-vel-1.wav, Similarity: 0.997
Your browser does not support the audio element.
Result 2: audio-data/Drums/25641__walter-odington__hot-rod-kick.wav, Similarity: 0.978
Your browser does not support the audio element.
Result 3: audio-data/Drums/25642__walter-odington__krusty-kick.wav, Similarity: 0.977
Your browser does not support the audio element.
Result 4: audio-data/Drums/25650__walter-odington__super-pulse-kick.wav, Similarity: 0.976
Your browser does not support the audio element.

-----------------------------------------------------------


Finding similar sounds to audio-data/Drums/183096__dwsd__bd-dust808.wav
Your browser does not support the audio element.
Result 1: audio-data/Bass/331480__staticpony1__analog-bass-vel-3.wav, Similarity: 0.957
Your browser does not support the audio element.
Result 2: audio-data/Bass/331485__staticpony1__analog-bass-vel-6.wav, Similarity: 0.954
Your browser does not support the audio element.
Result 3: audio-data/Drums/183115__dwsd__prc-appet909.wav, Similarity: 0.953
Your browser does not support the audio element.
Result 4: audio-data/Drums/183124__dwsd__prc-dust808tomlow.wav, Similarity: 0.946
Your browser does not support the audio element.

-----------------------------------------------------------


Finding similar sounds to audio-data/Keys/573643__acollier123__preset-synth-celesta-c.wav
Your browser does not support the audio element.
Result 1: audio-data/Keys/573627__acollier123__preset-brass-ensemble.wav, Similarity: 0.987
Your browser does not support the audio element.
Result 2: audio-data/Keys/573640__acollier123__preset-synth-clavi-c.wav, Similarity: 0.970
Your browser does not support the audio element.
Result 3: audio-data/Keys/573629__acollier123__preset-harpsichord-c.wav, Similarity: 0.969
Your browser does not support the audio element.
Result 4: audio-data/Keys/573630__acollier123__preset-blues-harmonica-c.wav, Similarity: 0.960
Your browser does not support the audio element.

-----------------------------------------------------------


Finding similar sounds to audio-data/Drums/270137__theriavirra__02-ride-long-cymbals-snares.wav
Your browser does not support the audio element.
Result 1: audio-data/Drums/269967__theriavirra__01-snare-aftershot-smooth-cymbals-snares.wav, Similarity: 0.980
Your browser does not support the audio element.
Result 2: audio-data/Drums/270100__theriavirra__snare-sample-2015-10-american-punch.wav, Similarity: 0.978
Your browser does not support the audio element.
Result 3: audio-data/Drums/269904__theriavirra__01b-snare-smooth-cymbals-snares.wav, Similarity: 0.977
Your browser does not support the audio element.
Result 4: audio-data/Drums/270158__theriavirra__04-snare-smooth-cymbals-snares.wav, Similarity: 0.976
Your browser does not support the audio element.

-----------------------------------------------------------


Results¶

The results demonstrated above seem quite good. The resulting Encoder model and similarity search based on a Cosine Similarity matrix would be imminently useful in the context of an audio sample browser or audio production application, where producers frequently have the need to quickly find similar audio to a given sample. There are, however, a few results returned which do not match the searched audio very well.

A more sophisticated approach, taking into account multiple discrete aspects of auditory perception (using different audio features with more models in an ensemble approach) may lead to more robust results. Further research points to possibly using a Convolutional LSTM model that could take into account the relationship of frequency over time, although the above simpler Convolutional approach does seem to capture a sense of time given the input images x-axis (width) does, in fact, represent time. Simpler features could be accounted for relatively easily, as well, such as RMS loudness, average pitch class (key centers vs atonal), and much more.

This was a very interesting project, and I look forward to continuing research in this direction.

In [ ]: